Introduction
Project Motivation
Neighborhoods with a high queer population, or Gayborhoods, have been urban areas in which LGBTQ+ people have found community and built culture worldwide. These geographic areas serve as much more than the sexuality of their constituents, and have been cited as yielding robust creative economies, as well as a welcoming environment for those of many identities. However, because queer identities are not directly measured by the Census and can often be hidden by a variety of discriminatory values, mapping Gayborhood status can be difficult1. Knowing this, the identity of a certain locale as a Gayborhood becomes a crucial sociological metric, with neighborhoods with a more prevalent queer identity driving social liberalism in the face of prejudice. Our analysis thus builds upon previous methods for classifying gayborhood status, first introduced by Castells and Murphy in 19822, with the creation of a location-based regression model that uses a variety of parameters including housing, land use, and non-queer demographic data to predict Gayborhood degree. In so doing, we will be able to determine whether an area is suitable for queer folks with the goal of advancing understanding of liberal areas and their interaction with daily life.
Workflow Overview
Our initial modeling process consisted of initial data collection (Section 2), exploratory analysis and feature engineering (Section 3.3), and modeling with v-fold cross validation (Section 4.1). This was followed by creation of an intelligent ensemble model (Section 4.3) and subsequent assessment of our ensemble on data that was withheld from our training set (Section 5). We assessed models mostly based on root mean squared error, or RMSE, as this is a common and topical metric for regression. We’d like to penalize predicted values which largely deviate from actual values due to potential harmful consequences of incorrect predictions. However, we also include the concordance correlation coefficient, or CCC, to assess the amount of variation in our outcome variable captured by our model as opposed to the traditional \(R^2\) since we’d like to know more about the fidelity of the model predictions to actual values. This process is represented below, in figure 1:
Figure 1: Diagram of the initial modeling process
Outcome Variable
In order to rank Gayborhood status, we are using a published dataset from data world3 that was analyzed to display distribution of queer communities in Jan Diehm’s 2018 article Men are from Chelsea, Women are from Park Slope. Specifically, the metrics in this dataset are combined to form a Gayborhood index, which we will be using a modified version of as the supervising variable for our machine learning project. This index is a holistic assessment of queerness, derived primarily from the following measures:
- Same-sex joint tax filers
- Unmarried partner same sex households
- Number of Gay Bars
- Whether or not a pride parade routes through the region
In this case, we will be training a regression model to predict Gayborhood index. The initial variable exists on a 0 to 100 scale with a mean of 2.39, median of 1.27, and a max of 67.1. In Section 3, we will discuss our measures to reprocess the outcome variable.
Data collection
Because this project is based primarily on data by ZCTA, a US Census-defined zip code proxy, we are able to pool data from two main sources – the US Census and Open ICPSR– to encompass 4 major facets of urban life.
- Our first predictor set4 comes from the US Census, and provides a variety of information about housing characteristics including number of housing units in an area, rent prices, and indicators of development (phone reception, income to rent ratio, etc.). This is derived from the Census’s ACS 5-year survey estimates for the year 2015. We use data from 2015 since that is also when the Gayborhood dataset was published. The data are relevant to our prediction problem of interest as housing characteristics may imply information about how affordable and lively local economies are, as urban areas tend to attract more LGBTQ+ identifying individuals.
- We also used US Census data to glean information about various demographic characteristics, as represented in our second data section5. This set primarily synthesizes demographic information that is not specifically related to sexuality (i.e. the parameters that went into the calculation of our outcome variable), but describe other characteristics of each area. As mentioned before, having these data might reveal how welcoming a city is based on diversity, while avoiding issues of multicolinearity.
- We also opted to include data from Open ICPSR, which included information about parks6 in each geographical region, including number of open parks and proportion of ZCTA area that is park space. Since parks are public areas, they serve as a predictive metric for the culture of a community, and are thus important in describing neighborhood lifestyle.
- Finally, we included data concerning the primary method of transport to work7, also from the US Census ACS survey. This will round out our predictors by describing the movement, as well as the physical locale, within a neighborhood, and may be useful to understand the structure of the area (urban/suburban/rural/other).
These yield a functional dataset with \(> 550\) predictors, which fall into the above categories. For example, these variables included:
- Estimate of households that made less than $10,000
- Estimate of households with 1 vehicle available
- Estimate of wholesale workers who carpooled with a car, truck, or van
- Count of open parks
- Estimate of residents aged 20 to 24
Issues and Challenges
Collecting and cleaning the data was fairly easy. The census data is well organized, and generally has low missingness, which allowed us to only throw out a few potential predictors and will allow us to impute the rest. For cleaning of census data, we also removed columns that contained margins of error and annotations rather than the actual variables. The parks data required the removal of some text form within the data columns and conversion to numeric format. After cleaning individually, all datasets were then combined into a single dataset.
One main concern with our data is the ratio of observations to predictors, since the ratio is currently ~ 4:1, which will drastically inflate the variance of predictions, counteracting our goal of accurately pinning down the index measure. Because of this undesirable ratio, most of our feature engineering efforts will rely on targeted dimensionality reduction in order to explain the variance captured in these variables without overfitting to the predictor variables.
Because census predictors were numerous, any variables with greater than 10% missing were removed. The missingness in any remaining variables is assumed to be missing completely at random (MCAR) since these parameters were collected and reported in the US Census ACS survey, which aims to report a nationally representative population.
Exploratory Data Analysis
Outcome Variable Transformation
To begin our exploratory analysis, we assessed methods for transformation of our outcome variable, an index measuring how queer a ZCTA region was based on parameters defined in this dataset from data world8. Based on the visualization below, a transformation should be applied to the outcome variable as it has a right skew, which will hinder the modeling process. Additionally, notice that the boxplot denotes many outliers within the data, which will lead to poor fits for models that can only generalize to common data. This report hopes to re-frame this index so our measure will appear to have data generated from a normal distribution.
Figure 2: Examination of Outcome Variable Skewness.
We originally planned to apply a log10 transformation to the outcome variable. However, the appearance of success from the log transformation was based off removal of 0 values, which are undefined (mapped to negative infinity). Examining the original data more closely, we find the following observations:
- Calculations for
totindex(the index measure) are not easily decipherable, and for a few ZCTAs, seem to be wrong entirely - There exist 0 values when some of the key queerness measures (e.g. number of same sex couples joint filing taxes) are positive
- Some zip codes that are included in the data have no documented inhabitants (like 20535 and 94128)
In attempts to remedy these issues, we first dropped any zip codes with \(<100\) inhabitants, and then recalculated our outcome variable, based on the following formula:
\[gayborhood\_index = 40*ss\_tax\_index + 40*ss\_live\_index + 10*parade + 10*bars\] where
- \(ss\_tax\_index\) is a measure of what percent of couples filing jointly are same sex couples (normalized by the maximum value, on a 0 to 1 scale)
- \(ss\_live\_index\) is a measure of what percent of unmarried couples living together are same sex couples (normalized by the maximum value, on a 0 to 1 scale)
- \(parade\) is an indicator variable indicating if a pride parade passes through that zip code (0 or 1)
- \(bars\) is a measure of how many gay bars there are in that area (normalized by the maximum value, 20, on a 0 to 1 scale)
This calculation accurately repurposes the original dataset for our use, which is not as concerned with queer identity across gender identities as was the original article Men are from Chelsea, Women are from Park Slope.
We then used a Yeo-Johnson transformation with \(\lambda = -0.268\) (optimized by the applicable step_YeoJohnson() in tidymodels) to transform our data and improve normality. This transformation was chosen as it allows us to keep our 0 values, which have an important physical representation in this dataset, without greatly compromising normality:
Figure 3: Yeo-Johnson transform improves normality of target distribution.
Now, the data appears to be generated from a normal distribution because of the bell-curve shape and the lack of outliers, which will improve model performance moving forward.
Predictor Variable Selection
The provided data listed over 550 possible predictor variables. Fitting models to all potential predictor variables would not only be computationally expensive, but increase the variance of the predictions. Because of these issues, this report aims to mechanically select the most relevant predictors for the regression problem.
A LASSO model was fit to the training set to further narrow down the list of predictors by penalizing the addition of terms to a regression model. A lenient penalty term of 0.01 was employed, along with steps to impute missing values with K-Nearest Neighbors and remove variables with near zero variance, which would have provided little help in the prediction problem. Ultimately, about 100 predictors remained to be used in the model fitting process.
Feature Engineering
In addition to the issue of having many possible predictor variables, there are varying degrees of missingness across the predictors. Many of the variables with larger degrees of missingness were removed in the process of LASSO selection, and the remaining variables had low missingness (under 2%). The remaining missing values are summarized in the table below, with higher missingness listed first.
Table 1: Missingness in the predictor set.
The missingness for the rest of the models was also resolved by imputing the variables with values from a K-Nearest Neighbors model, selected for the balance of quick computational times and flexible non-linear methods based on its non-parametric underpinnings.
Our model was based relationships between the Gayborhood index and all of the remaining predictors after LASSO selection. Additional feature engineering steps include the following:
- removal of all near zero variance predictors (though this is ideally redundant after LASSO selection)
- updating the role of the ID column to ensure it is not used as a predictor,
- dummy encoding categorical predictors,
- normalization of all predictors (centered and scaled to unit variance)
This feature engineering workflow was decided on after initial tinkering with the elastic net model as a base for understanding our data. In order to increase the amount of variation explained by this model, we included a version that included all pairwise interactions between the feature selected terms, and then proceeded to use partial least squares tuned to the outcome to reduce dimensionality. Number of components used in this model was tuned, yet it still performed worse than the base elastic net model, likely because the penalty term proxies for targeted dimensionality reduction.
Data Splitting and Folding
We split the data into training and testing sets with a proportion of 80% training and 20% testing, stratified by the outcome variable with 4 bins. We chose this as it is the convention for datasets that are large enough that withholding data from the training set is not a concern.
We used v-fold cross validation to tune hyperparameters while preventing overfitting to the training data. We used 5 folds with 3 repeats as opposed to larger numbers to reduce model runtimes.
Modeling Process
Our initial modeling process began with fitting our data to a variety of different supervised ML model types, in order to determine which one would best be able to predict new data. For these model types (with the exception of neural networks, random forest, and MARS), Bayesian iteration was performed to optimize hyperparameter combinations by testing new values against prior expectations generated by an initial grid tune.
Initial Model Types
The following model types were used for initial analysis:
- Penalized Elastic Net Regression
- with and without partial least squares
glmnetengine
- Random Forest
rangerengine
- Boosted Tree
xgboostengine
- K-Nearest Neighbors
kknnengine
- Neural Networks
- Using
kerasand Tensorflow from python through R - Using
nnetin R
- Using
- Multivariate Adaptive Regression Splines (MARS)
earthengine
- Cubist ensemble regression
cubistengine
Model Results
Based on resample cross validation, initial results showed that the cubist ensemble model performed the best, with a mean RMSE of .328. Elastic net regression (without PLS) and random forest were also in the top 3 models, all with RMSEs under .34. K-nearest neighbors and boosted trees performed almost as well, both under .35. There is then a significant gap to the worse models, those being elastic net with PLS, multivariate adaptive regression splines, and neural networks with keras. This is summarized in table 2:
| Initial Model Comparison | |||
| Model Type | Best Mean RMSE | Standard Error | Optimal Hyperparameters |
|---|---|---|---|
| Cubist Ensemble Regression (BI) | 0.328 | 0.004 | Committees = 88, Max Rules = 290, Neighbors = 0 |
| Elastic Net (BI) | 0.337 | 0.004 | Penalty = 1.1e-08, Mixture = 0.051 |
| Random Forest (BI) | 0.339 | 0.003 | Mtry = 91, Trees = 1900, Min N = 3 |
| K-Nearest Neighbors (BI) | 0.343 | 0.003 | Neighbors = 22, Dist Power = 1 |
| Boosted Trees (BI) | 0.348 | 0.003 | Min N = 38, Learn Rate = 0.25, Loss Reduction = 1.2e-09 |
| Elastic Net (PLS) (BI) | 0.370 | 0.003 | Penalty = 0.0032, Mixture = 1 |
| Multivariate Adaptive Regression Splines (BI) | 0.376 | 0.003 | Num Terms = 5, Prod Degree = 1 |
| Neural Network (nnet) | 0.379 | 0.003 | Hidden Units = 10, Penalty = 1.7e-08, Epochs = 23 |
| Neural Network (Keras) | 0.398 | 0.004 | Penalty = 1, Hidden Units = 5 |
| Null Model | 0.503 | 0.002 | NA |
Table 2: Initial model results. RMSEs are averaged across folds, and hyperparameter combinations for each model with the lowest mean RMSE are shown.
In order to gauge the computational efficiency of these calculations, we also display the amount of time it took to fit each of these model types, both as time per model in the initial tune (in milliseconds), and time per iteration of Bayesian optimization (in seconds):
| Initial Model Comparison | ||
| Model Type | Average Time per Model (ms) | Time per Bayesian Iteration (s) |
|---|---|---|
| Elastic Net | 30.02 | 0.12 |
| Boosted Trees | 56.66 | 0.45 |
| Elastic Net (PLS + interactions) | 90.49 | 16.97 |
| K-Nearest Neighbors | 208.81 | 1.88 |
| MARS | 280.78 | NA |
| Cubist Ensemble Regression | 773.60 | 31.33 |
| Random Forest | 1637.52 | NA |
| Neural Network (nnet) | 17516.00 | NA |
| Neural Network (Keras) | 18231.31 | 937.61 |
Table 3: Initial model time results. Time per model accounts for number of hyperparameter combinations, number of folds, and number of repeats. Models were trained using 5-8 threads, except for keras models, which were trained in series.
Unsurprisingly, elastic net models fit the quickest to our data. It is important to note here that parallel processing was not possible for the neural network case, which partially explains their much higher values. Here, it is also important to note that cubist ensemble regression, which had the lowest RMSE, did take longer than most of our other models. It is still within a reasonable range for this set, but in similar studies elastic net could be a more computationally efficient choice.
Ensemble Model
Initially, our plan was to fit an ensemble model that combined the top 3 performing initial models, which were:
- cubist ensemble regression
- elastic net regression (no interaction/PLS)
- random forest
A peek at the initial contributors revealed that 6 configurations were cubist ensemble models, 2 configurations were elastic net, and 2 configurations were random forest. However, we note that Figure 2 highlights RMSE values for K-nearest neighbors and Boosted trees that were very similar to our selected three. For this reason, we also fit an ensemble model with all 5 of these model types as inputs. Initial results from these two fits are displayed below:
Figure 4: Comparison of different Ensemble model options. 3 inputs maximizes RMSE.
Here, a look at the top contributors to the 5 input ensemble reveals an interesting trend:
| Member Weights of 5 Input Ensemble | |
| type | weight |
|---|---|
| Random Forest | 0.1930 |
| Cubist Ensemble | 0.1740 |
| Elastic Net | 0.1740 |
| Cubist Ensemble | 0.1540 |
| KNN | 0.0587 |
| Cubist Ensemble | 0.0551 |
| KNN | 0.0484 |
| Boosted Tree | 0.0484 |
| KNN | 0.0406 |
| Boosted Tree | 0.0371 |
Table 4: Weights applied to individual models in the 5 type input stack. Model types in the initial 3 input stack are highlighted in green.
Our 3 model types from the first ensemble are the top 4 contributors, and comprise 76% of this second ensemble.
Additionally the 3 input model was seen to have a lower RMSE at the chosen number of members, and because this is also the simplest (and therefore least likely to overfit) model, we proceed with it as our final model.
Final Results
Fitting our final ensemble model to the testing data and undoing the Yeo-Johnson transform, we find the root mean squared error of predictions was 3.30, and the concordance correlation coefficient was 0.714. In other words, the average miss of the model was 3.30%, which is a welcome sight as our index is scored out of 100 percentage points. This is thus a valuable model for predicting queer acceptance in areas based on the way life is organized in those locations. Interestingly, this RMSE is higher than two of the ensemble members (both cubist ensemble models) individually when fit to the testing data, most likely due to issues with overfitting, as shown in Table 5:
| Ensemble Results | |
| Model Name | RMSE |
|---|---|
| cer_bayes_1_23 | 3.196 |
| cer_bayes_1_24 | 3.216 |
| ensemble | 3.304 |
| cer_bayesIter9 | 3.318 |
| cer_bayes_1_09 | 3.376 |
| cer_bayes_1_06 | 3.378 |
| cer_bayes_1_05 | 3.414 |
| en_bayes_1_1 | 3.466 |
| en_bayesIter3 | 3.466 |
| rf_bayesIter1 | 3.673 |
| rf_bayes_1_3 | 4.326 |
Table 5: Ensemble results by member. Ensemble refers to the entire ensemble performance.
In order to ensure that this discrepancy between our expected lowest RMSE (our ensemble) and our actual lowest RMSE (a cubist constituent) really exists, we fit a cubist ensemble model with the best performing hyperparameters on our resamples. This yields an RMSE of 3.32, which is higher than our ensemble model. Thus, we conclude that the two specific cubist models that performed better than our ensemble happened to be better at prediction in our testing data, and we choose not to further use these models outside of the ensemble in order to avoid overfitting to our testing data.
To further analyze our results, we plot predicted vs true values of our index, as well as fitted vs residual values:
Figure 5: Assessment of fit of final ensemble model. Model aptly captures variance in outcome variable, especially for lower values.
These two plots highlight the general success of our model– predictions correlate with actual values and are relatively faithful to the 45 degree line, and there is no clear trend in residuals that our model fails to capture. That said, it is important to note that this model did a much better job of fitting to lower Gayborhood index values, which were more plentiful in both the training and testing data.
Conclusions
For this reason, we conclude that our model is a valuable tool for predicting areas that are likely to be less-LGBTQ friendly, with less direct predictive strength over areas with a high proportion of queer residents. This proves the predictive strength of our input data, which concerned general information concerning day-to-day life in a region, in the determination of LGBTQ friendly areas. Sociologically, this model could thus become a valuable way to gauge the LGBTQ+ engagement in areas based on parameters that describe their general way of life, enabling folks everywhere to live with pride.
Future Directions
Future studies resembling this one could explore more feature engineering steps– our conclusions are generally based on the simple feature engineering that we landed on as a valuable and computationally efficient way to capture variance with our models, but it is possible that other methods could also aid in this process. For example, further extraction of data features through natural splines, b-splines, convex splines, and monotone splines with varying degrees of freedom could prove useful in prediction. Oftentimes, the data is nonlinear, so implementing these steps may improve model performance at the risk of overfitting.
Other potential studies may look into other models not included within the
tidymodelsframework. For example, boosted trees are gaining traction as methods to improve model performance and decrease computational times continue to be developed. For example, usingcatboost(gradient boosting) orLightGBMmay be useful as currently,XGBoostis the main method of implementing a boosted tree. Other methods like automatic feature selection throughfeaturetoolscould also be worth testing to see if further improvements can be made to increase predictive power of models and data.Additionally, in similar studies we would be interested in determining if decreasing the ensemble model input to 2 members would help, before fitting to our testing data. This insight comes after fitting to test data, where we notice that our two random forest contributors had much higher RMSE values than the other candidate models. However, we avoid performing this ensemble reduction on our data, since we do not want knowledge of the test data to bias our model building.
Besides feature engineering and modeling steps, we also acknowledge our data sources are imperfect and most likely do not capture the entire story here. Even though our project is not inferential in nature, we are still worried about factors including measurement error of some qualitative variables and omitted variable bias. These concerns may be remedied with additional background research and literature review into current methodologies surrounding the study of Gayborhoods.
Footnotes
Ghaziani, A. (2014). APPENDIX: WHAT ARE GAYBORHOODS? HOW DO WE STUDY THEM? In There Goes the Gayborhood? (pp. 271–286). Princeton University Press. http://www.jstor.org/stable/j.ctt6wq0cv.13↩︎
Castells, M., & Murphy, K. (1982). Cultural identity and urban structure: the spatial organization of San Francisco’s gay community. Urban policy under capitalism, 20, 237-259.↩︎
Jan Diehm. (2018). Gayborhoods [Data set]. The Pudding, data.world. https://data.world/the-pudding/gayborhoods. Accessed 10 April 2023.↩︎
US Census. (2021). DP04: SELECTED HOUSING CHARACTERISTICS [Data set]. data.census.gov. https://data.census.gov/table?q=rent&g=010XX00US$8600000&tid=ACSDP5Y2021.DP04. Accessed 24 April 2023.↩︎
US Census. (2015). DP05: ACS DEMOGRAPHIC AND HOUSING ESTIMATES [Data set]. data.census.gov. https://data.census.gov/table?g=010XX00US$8600000&tid=ACSDP5Y2015.DP05. Accessed 26 April 2023.↩︎
Li, Mao, Melendez, Robert, Khan, Anam, Gomez-Lopez, Iris, Clarke, Philippa, and Chenoweth, Megan. (2020). National Neighborhood Data Archive (NaNDA): Parks by ZIP Code Tabulation Area, United States, 2018. Ann Arbor, MI: Inter-university Consortium for Political and Social Research. https://doi.org/10.3886/E119803V1. Accessed 23 April 2023↩︎
US Census. (2015). S0802: MEANS OF TRANSPORTATION TO WORK BY SELECTED CHARACTERISTICS [Data set]. data.census.gov. https://data.census.gov/table?q=commuting&g=010XX00US$8600000&tid=ACSST5Y2021.S0802. Accessed 30 April 2023.↩︎
Jan Diehm. (2018). Gayborhoods [Data set]. The Pudding, data.world. https://data.world/the-pudding/gayborhoods. Accessed 10 April 2023.↩︎